library(tidyverse)
library(caret)
library(nnet)
library(plotly)
library(randomForest)
In this analysis, we will examine the effects of water temperature on the number of unique fish species observed in the Great Barrier Reef from 1997 to 2011.
Originally, we wanted to examine a relationship between water temperature and coral cover. However, we quickly saw that, when visualized, there does not seem to be a linear relation and we did not want to fit a mis-specified model.
Merge between temperature and coral cover data sets
temp <- read_csv("cleaned_data/aims_temperatures.csv")
## Parsed with column specification:
## cols(
## date = col_date(format = ""),
## avg_water_temp_2.0m_flat_site = col_double(),
## avg_water_temp_9.0m_slope_site = col_double()
## )
coral_cover <- read_csv("cleaned_data/coral_cover.csv")
## Parsed with column specification:
## cols(
## date = col_date(format = ""),
## mean_live_coral_cover_percent = col_double(),
## lower_conf_int = col_double(),
## upper_conf_int = col_double(),
## conf_int_span = col_double()
## )
temp_coral_cover <- merge(x=temp, y=coral_cover, by="date")
temp_coral_cover %>%
ggplot(aes(avg_water_temp_2.0m_flat_site, mean_live_coral_cover_percent)) +
geom_point()
temp_coral_cover %>%
ggplot(aes(avg_water_temp_9.0m_slope_site, mean_live_coral_cover_percent)) +
geom_point()
As we can see, there is no linear relationship. While there are 2 or 3 clusters of coral cover percentage values, they seem to be pretty consistent across the range of water temperatures at 2.0m and 9.0 below sea level.
We also suspect that increase in water temperature could also affect diversity in sea life. So, we will examine the relationship between water temperature and the number of unique species of fish observed in the Great Barrier Ree. First, we can do some basic visualization by plotting the relationship between water temperatures at depth 2.0m and 9.0m with the number of unique fish species observedin the Great Barrier Reef.
fish <- read_csv("cleaned_data/fish_species_counts.csv")
## Parsed with column specification:
## cols(
## date = col_date(format = ""),
## num_of_species = col_double()
## )
fish_temp <- merge(x=temp, y=fish, by="date")
# water temp at 2.0m
fish_temp %>% ggplot(aes(avg_water_temp_2.0m_flat_site, num_of_species)) + geom_point()
# water temp at92.0m
fish_temp %>% ggplot(aes(avg_water_temp_9.0m_slope_site, num_of_species)) + geom_point()
There seems to be a bit of negative linear relationship going on, so we will fit a linear model examining the number of unique species discovered in relation to rising temperature.
Now, we can split our data set into train and test sets, using 0.6 to partition our data. Our outcome is the mean coral cover percentage and water temperature is our covariate. We will fit 2 linear regression models: one examining effect of water temperature at 2.0m and the other examining the effect of temperature at 9.0m.
train_index <- createDataPartition(y=fish_temp$num_of_species, times=1, p = 0.6, list=FALSE)
train_set <- fish_temp[train_index, ]
test_set <- fish_temp[-train_index, ]
Fit linear regression model:
fish_temp_2.0m <- lm(num_of_species ~ avg_water_temp_2.0m_flat_site, data=train_set)
summary(fish_temp_2.0m)
##
## Call:
## lm(formula = num_of_species ~ avg_water_temp_2.0m_flat_site,
## data = train_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -54.577 -9.236 0.413 9.609 38.877
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 140.0168 16.2550 8.614 1.98e-15 ***
## avg_water_temp_2.0m_flat_site -2.8334 0.5914 -4.791 3.20e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.8 on 203 degrees of freedom
## Multiple R-squared: 0.1016, Adjusted R-squared: 0.09717
## F-statistic: 22.96 on 1 and 203 DF, p-value: 3.196e-06
fish_temp_9.0m <- lm(num_of_species ~ avg_water_temp_9.0m_slope_site, data=train_set)
summary(fish_temp_9.0m)
##
## Call:
## lm(formula = num_of_species ~ avg_water_temp_9.0m_slope_site,
## data = train_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -54.401 -8.931 0.906 9.523 39.400
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 142.5193 16.7272 8.520 3.59e-15 ***
## avg_water_temp_9.0m_slope_site -2.9379 0.6114 -4.805 3.00e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.8 on 203 degrees of freedom
## Multiple R-squared: 0.1021, Adjusted R-squared: 0.09771
## F-statistic: 23.09 on 1 and 203 DF, p-value: 3e-06
We see that the models are very similar in results. The coefficient with covariate 2.0m water temperature and 9.0 water temperature is -2.56 and -2.61, respectively.
Although we should not expect a major difference between how each of the two models performs, let’s compare them anyways to assess which water temperature depth is a better predictor of unique fish species observed in the Great Barrier Reef.
pred_2.0m <- predict(fish_temp_2.0m, test_set)
pred_9.0m <- predict(fish_temp_9.0m, test_set)
postResample(pred = pred_2.0m, obs = test_set$num_of_species)
## RMSE Rsquared MAE
## 15.73121376 0.03996665 11.87241580
postResample(pred = pred_9.0m, obs = test_set$num_of_species)
## RMSE Rsquared MAE
## 15.67723856 0.04565415 11.82100056
As expected, results are very similar.
We can assess this visually to confirm our results.
# water temp at 2.0m
test_set %>%
ggplot(aes(avg_water_temp_2.0m_flat_site, num_of_species)) +
geom_point() +
geom_abline(intercept=fish_temp_2.0m$coefficients[1], slope=fish_temp_2.0m$coefficients[2], col="red")
# water temp at 9.0m
test_set %>%
ggplot(aes(avg_water_temp_9.0m_slope_site, num_of_species)) +
geom_point() +
geom_abline(intercept=fish_temp_9.0m$coefficients[1], slope=fish_temp_9.0m$coefficients[2], col="blue")
They both perform very similarly, and choosing either water temperature as our predictor will yield similar results.
Continuing on from sea life diversity, we have data on presence or absence of certain seagrass species in the Great Barrier Reef from 1999 to 2003. Let’s try to build a classifier to determine how location, and types of sediment and seabed may predict presence of certain sea grass.
As written in the data wrangling and cleaning RMarkdown/HTML file, we chose 4 species to classify: Cymodocea Serrulata, Syringodium Isoetifolium, Thalassia Hemprichii, and Zostera Muelleri (subspecies Capricorni).
seagrass <- read.csv("cleaned_data/seagrass_classification_data.csv", as.is =TRUE)
seagrass$SPECIES <- as.factor(seagrass$SPECIES)
seagrass$SEDIMENT <- as.factor(seagrass$SEDIMENT)
seagrass$TIDAL <- as.factor(seagrass$TIDAL)
head(seagrass)
summary(seagrass)
## SPECIES LATITUDE LONGITUDE DEPTH
## C_SERRULAT: 1256 Min. :-24.20 Min. :142.5 Min. : 0.0000
## S_ISOETIFO: 101 1st Qu.:-23.79 1st Qu.:146.8 1st Qu.: 0.0000
## T_HEMPRICH: 823 Median :-22.41 Median :150.5 Median : 0.0000
## Z_CAPIRCOR:10201 Mean :-21.06 Mean :149.0 Mean : 0.9253
## 3rd Qu.:-19.17 3rd Qu.:151.3 3rd Qu.: 1.2279
## Max. :-10.75 Max. :151.9 Max. :29.3583
##
## SEDIMENT TIDAL
## Gravel: 1 intertidal:7433
## Mud :8969 subtidal :4948
## Reef : 63
## Rock : 66
## Rubble: 107
## Sand :3151
## Shell : 24
First we plot sea grass according to location (latitude and longitude). Then we will add a third axis (depth) to visualize this in 3-dimensions using plotly package. Since depth is measured in meters below sea level, we visualize this in negative values.
seagrass %>% ggplot() + geom_point(aes(x=LATITUDE, y=LONGITUDE, color=SPECIES))
plot_ly(seagrass, x=~LATITUDE, y=~LONGITUDE, z=~-DEPTH, color=~SPECIES, type="scatter3d", mode="markers")
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
We can also see how our categorical predictors relate to sea grass species.
seagrass %>%
ggplot(aes(SEDIMENT, fill=SPECIES)) + geom_bar(width=.5, position = "dodge")
seagrass %>%
ggplot(aes(TIDAL, fill=SPECIES)) + geom_bar(width=.5, position = "dodge")
We will first use random forest to build a classifier and then use a multinomial logistic regression model, and compare the two.
Let’s first partition our data set into train and test sets. Since we have a lot more data here than in the linear regression model, we will partition it by 75%-25%.
# test-train split
seagrass_train_ind <- createDataPartition(y = seagrass$SPECIES, p=0.75, list=FALSE)
train_set <- seagrass[seagrass_train_ind, ]
test_set <- seagrass[-seagrass_train_ind, ]
rf_fit <- randomForest(SPECIES ~ LATITUDE + LONGITUDE + DEPTH + SEDIMENT + TIDAL,
data=train_set,
mtry = 2)
rf_fit
##
## Call:
## randomForest(formula = SPECIES ~ LATITUDE + LONGITUDE + DEPTH + SEDIMENT + TIDAL, data = train_set, mtry = 2)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 4.61%
## Confusion matrix:
## C_SERRULAT S_ISOETIFO T_HEMPRICH Z_CAPIRCOR class.error
## C_SERRULAT 712 15 10 205 0.24416136
## S_ISOETIFO 32 28 5 11 0.63157895
## T_HEMPRICH 45 2 558 13 0.09708738
## Z_CAPIRCOR 80 5 5 7561 0.01176317
rf_pred <- predict(rf_fit, newdata = test_set, type = "response")
confusionMatrix(table(pred = rf_pred, true = test_set$SPECIES))
## Confusion Matrix and Statistics
##
## true
## pred C_SERRULAT S_ISOETIFO T_HEMPRICH Z_CAPIRCOR
## C_SERRULAT 227 12 18 22
## S_ISOETIFO 3 7 0 1
## T_HEMPRICH 6 1 185 1
## Z_CAPIRCOR 78 5 2 2526
##
## Overall Statistics
##
## Accuracy : 0.9518
## 95% CI : (0.9437, 0.9591)
## No Information Rate : 0.8242
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8346
##
## Mcnemar's Test P-Value : 2.089e-08
##
## Statistics by Class:
##
## Class: C_SERRULAT Class: S_ISOETIFO Class: T_HEMPRICH
## Sensitivity 0.72293 0.280000 0.90244
## Specificity 0.98129 0.998697 0.99723
## Pos Pred Value 0.81362 0.636364 0.95855
## Neg Pred Value 0.96909 0.994162 0.99311
## Prevalence 0.10149 0.008080 0.06626
## Detection Rate 0.07337 0.002262 0.05979
## Detection Prevalence 0.09017 0.003555 0.06238
## Balanced Accuracy 0.85211 0.639348 0.94983
## Class: Z_CAPIRCOR
## Sensitivity 0.9906
## Specificity 0.8438
## Pos Pred Value 0.9674
## Neg Pred Value 0.9503
## Prevalence 0.8242
## Detection Rate 0.8164
## Detection Prevalence 0.8439
## Balanced Accuracy 0.9172
We see that our classification model works quite well, especially for T_HEMPRICH and Z_CAPRICOR, which have 85%+ sensitivity and specificity. However, we got quite a low sensitivity for S_ISOETIFO. Recall to our data wrangling portion that S_ISOETIFO had only about 100 “Yes” observations. Since we had a small sample size for S_ISOETIFO relative to the other 3 seagrass species, this may have contributed to the low sensitivity.
We can visually assess our predicted values with true values of species to see how our model performed.
# true values
plot_ly(test_set, x=~LATITUDE, y=~LONGITUDE, z=~-DEPTH, color=~SPECIES, type="scatter3d", mode="markers")
# predicted values
plot_ly(test_set, x=~LATITUDE, y=~LONGITUDE, z=~-DEPTH, color=~rf_pred, type="scatter3d", mode="markers")
For sediment type:
test_set %>%
ggplot(aes(SEDIMENT, fill=SPECIES)) + geom_bar(width=.5, position = "dodge")
test_set %>%
ggplot(aes(SEDIMENT, fill=rf_pred)) + geom_bar(width=.5, position = "dodge")
For seabed type:
test_set %>%
ggplot(aes(TIDAL, fill=SPECIES)) + geom_bar(width=.5, position = "dodge")
test_set %>%
ggplot(aes(TIDAL, fill=rf_pred)) + geom_bar(width=.5, position = "dodge")
Let’s examine variable importance.
variable_importance <- importance(rf_fit)
tmp <- data_frame(feature = rownames(variable_importance),
Gini = variable_importance[,1]) %>%
arrange(desc(Gini))
## Warning: `data_frame()` is deprecated as of tibble 1.1.0.
## Please use `tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
tmp
We see that longitude and latitude were very predictive of presence of seagrass followed by depth from sea level. The types of sediment and seabed (intertidal or subtidal seabed) are not very good predictors. Thus, it seems that the location of where the sea grass was discovered matters more than the various ocean floor properties.
tmp %>% ggplot(aes(x=reorder(feature, Gini), y=Gini)) +
geom_bar(stat='identity') +
coord_flip() + xlab("Feature") +
theme(axis.text=element_text(size=8))
We can now try a multinomial logistic regression model to see how it compares to random forest. We will use the nnet package.
The logistic regression model is as follows:
multinom_fit <- multinom(SPECIES ~ LATITUDE + LONGITUDE + DEPTH + SEDIMENT, data=train_set)
## # weights: 44 (30 variable)
## initial value 12874.515732
## iter 10 value 3231.438784
## iter 20 value 2579.045689
## iter 30 value 2488.644590
## iter 40 value 2418.828498
## iter 50 value 2403.582568
## iter 60 value 2377.804660
## iter 70 value 2377.669690
## iter 80 value 2377.612183
## iter 90 value 2375.572351
## iter 100 value 2375.518392
## final value 2375.518392
## stopped after 100 iterations
summary(multinom_fit)
## Call:
## multinom(formula = SPECIES ~ LATITUDE + LONGITUDE + DEPTH + SEDIMENT,
## data = train_set)
##
## Coefficients:
## (Intercept) LATITUDE LONGITUDE DEPTH SEDIMENTMud
## S_ISOETIFO -346.4807 2.2876720 3.011165 -0.001924993 -56.98242
## T_HEMPRICH -137.8337 1.9013715 1.334503 -0.320212571 -25.95114
## Z_CAPIRCOR -175.5349 0.6886591 1.486399 -0.830635230 -27.10451
## SEDIMENTReef SEDIMENTRock SEDIMENTRubble SEDIMENTSand SEDIMENTShell
## S_ISOETIFO -55.78712 -56.71262 -65.63158 -56.06705 -55.29990
## T_HEMPRICH -22.69174 -21.99150 -21.58262 -22.94339 -22.67330
## Z_CAPIRCOR -30.90690 -29.86489 -31.21765 -28.01992 -28.42101
##
## Std. Errors:
## (Intercept) LATITUDE LONGITUDE DEPTH SEDIMENTMud
## S_ISOETIFO 0.002745115 0.05727101 0.007931424 0.02544938 0.4149306
## T_HEMPRICH 0.001313935 0.05666575 0.007251568 0.04008419 0.2994008
## Z_CAPIRCOR 0.002608711 0.02894079 0.004613846 0.02963196 0.3500373
## SEDIMENTReef SEDIMENTRock SEDIMENTRubble SEDIMENTSand SEDIMENTShell
## S_ISOETIFO 0.7324316 0.9353767 2.028653e-05 0.3841211 0.9586960
## T_HEMPRICH 0.5047530 0.6375312 4.731509e-01 0.2702049 0.9276766
## Z_CAPIRCOR 0.9410291 0.7697505 1.083037e+00 0.3521422 0.8287129
##
## Residual Deviance: 4751.037
## AIC: 4805.037
Relative risk ratios where reference group is C_SERRULAT
exp(coef(multinom_fit))
## (Intercept) LATITUDE LONGITUDE DEPTH SEDIMENTMud SEDIMENTReef
## S_ISOETIFO 3.352302e-151 9.851976 20.311041 0.9980769 1.789980e-25 5.915087e-25
## T_HEMPRICH 1.379076e-60 6.695071 3.798108 0.7259947 5.364941e-12 1.396699e-10
## Z_CAPIRCOR 5.836702e-77 1.991044 4.421147 0.4357724 1.693021e-12 3.778359e-14
## SEDIMENTRock SEDIMENTRubble SEDIMENTSand SEDIMENTShell
## S_ISOETIFO 2.344337e-25 3.137364e-29 4.470860e-25 9.628484e-25
## T_HEMPRICH 2.813269e-10 4.234346e-10 1.085953e-10 1.422695e-10
## Z_CAPIRCOR 1.071129e-13 2.769142e-14 6.778028e-13 4.538492e-13
# predicted probabilities
predicted_prob <- predict(multinom_fit, newdata=test_set, type="probs")
# predicted classes
predicted_class <- predict(multinom_fit, newdata=test_set, type="class")
confusionMatrix(data = predicted_class, reference = test_set$SPECIES )
## Confusion Matrix and Statistics
##
## Reference
## Prediction C_SERRULAT S_ISOETIFO T_HEMPRICH Z_CAPIRCOR
## C_SERRULAT 126 12 11 38
## S_ISOETIFO 1 1 0 0
## T_HEMPRICH 15 2 186 13
## Z_CAPIRCOR 172 10 8 2499
##
## Overall Statistics
##
## Accuracy : 0.9089
## 95% CI : (0.8982, 0.9188)
## No Information Rate : 0.8242
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6661
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: C_SERRULAT Class: S_ISOETIFO Class: T_HEMPRICH
## Sensitivity 0.40127 0.0400000 0.90732
## Specificity 0.97806 0.9996742 0.98962
## Pos Pred Value 0.67380 0.5000000 0.86111
## Neg Pred Value 0.93533 0.9922380 0.99340
## Prevalence 0.10149 0.0080802 0.06626
## Detection Rate 0.04072 0.0003232 0.06012
## Detection Prevalence 0.06044 0.0006464 0.06981
## Balanced Accuracy 0.68967 0.5198371 0.94847
## Class: Z_CAPIRCOR
## Sensitivity 0.9800
## Specificity 0.6507
## Pos Pred Value 0.9293
## Neg Pred Value 0.8741
## Prevalence 0.8242
## Detection Rate 0.8077
## Detection Prevalence 0.8691
## Balanced Accuracy 0.8154
We see that our multinomial logistic model has about 91% overall accuracy, which performs a bit worse than random forest. However, the model performs very badly in predicting for S_ISOETIFO.
The model seems to predict T_HEMPRICH the best with 88.7% sensitivity and 98.8% specificity. The model also does not perform well for sensitivity of C_SERRULAT, with only about 41.7% sensitivity.
Again, we can assess our results visually:
# true values
plot_ly(test_set, x=~LATITUDE, y=~LONGITUDE, z=~-DEPTH, color=~SPECIES, type="scatter3d", mode="markers")
# predicted values
plot_ly(test_set, x=~LATITUDE, y=~LONGITUDE, z=~-DEPTH, color=~predicted_class, type="scatter3d", mode="markers")
For sediment type:
test_set %>%
ggplot(aes(SEDIMENT, fill=SPECIES)) + geom_bar(width=.5, position = "dodge")
test_set %>%
ggplot(aes(SEDIMENT, fill=predicted_class)) + geom_bar(width=.5, position = "dodge")
For seabed type:
test_set %>%
ggplot(aes(TIDAL, fill=SPECIES)) + geom_bar(width=.5, position = "dodge")
test_set %>%
ggplot(aes(TIDAL, fill=predicted_class)) + geom_bar(width=.5, position = "dodge")
We can confirm in our visualizations that the logistic regression model failed to predict for S_ISOETIFO.
So, we see that the overall accuracy for multinomial logistic regression model vs random forest model was 95.6% and 90.9%, respectively. However, the overall accuracy stated for the logistic regression model is deceiving since it did not perform well in sensitivity in 2 out of 4 classes.